Prosper Loan Data Analysis by Carlos Blanco

This dataset is from prosper and contains data for 113937 loans they made from 2006 to 2014. The main goal is to determine what variables set the interest paid by the borrowers (BorrowerRate)

Univariate Plots Section

First, we see the variables names, classes and summaries.

## 'data.frame':    113937 obs. of  10 variables:
##  $ BorrowerRate           : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ EstimatedReturn        : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.: int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ EmploymentStatus       : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ CreditScoreRangeLower  : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper  : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ DebtToIncomeRatio      : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange            : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ LoanOriginalAmount     : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate    : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##   BorrowerRate    EstimatedReturn  ProsperRating..numeric.
##  Min.   :0.0000   Min.   :-0.183   Min.   :1.000          
##  1st Qu.:0.1340   1st Qu.: 0.074   1st Qu.:3.000          
##  Median :0.1840   Median : 0.092   Median :4.000          
##  Mean   :0.1928   Mean   : 0.096   Mean   :4.072          
##  3rd Qu.:0.2500   3rd Qu.: 0.117   3rd Qu.:5.000          
##  Max.   :0.4975   Max.   : 0.284   Max.   :7.000          
##                   NA's   :29084    NA's   :29084          
##       EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
##  Employed     :67322   Min.   :  0.0         Min.   : 19.0        
##  Full-time    :26355   1st Qu.:660.0         1st Qu.:679.0        
##  Self-employed: 6134   Median :680.0         Median :699.0        
##  Not available: 5347   Mean   :685.6         Mean   :704.6        
##  Other        : 3806   3rd Qu.:720.0         3rd Qu.:739.0        
##               : 2255   Max.   :880.0         Max.   :899.0        
##  (Other)      : 2718   NA's   :591           NA's   :591          
##  DebtToIncomeRatio         IncomeRange    LoanOriginalAmount
##  Min.   : 0.000    $25,000-49,999:32192   Min.   : 1000     
##  1st Qu.: 0.140    $50,000-74,999:31050   1st Qu.: 4000     
##  Median : 0.220    $100,000+     :17337   Median : 6500     
##  Mean   : 0.276    $75,000-99,999:16916   Mean   : 8337     
##  3rd Qu.: 0.320    Not displayed : 7741   3rd Qu.:12000     
##  Max.   :10.010    $1-24,999     : 7274   Max.   :35000     
##  NA's   :8554      (Other)       : 1427                     
##           LoanOriginationDate
##  2014-01-22 00:00:00:   491  
##  2013-11-13 00:00:00:   490  
##  2014-02-19 00:00:00:   439  
##  2013-10-16 00:00:00:   434  
##  2014-01-28 00:00:00:   339  
##  2013-09-24 00:00:00:   316  
##  (Other)            :111428

I observe that I need to transform some variables. First the Prosper rating is a number and we will need it as factor, also the loan origination date is in datetime format and I proceed to extract years to study the changes over time.

I will combine Credit Score Range Upper and Lower into an Average because they measure the same variable and rearrange the order of Income Levels for a better plot visualization.

Also, I observe that a lot of variables have NAs, so in order to examine data with all of the variables I will remove rows with NAs.

ld$LoanOriginationDate.year = substr(ld$LoanOriginationDate, 1, 4)

ld$CreditScoreAverage <- (ld$CreditScoreRangeLower + ld$CreditScoreRangeUpper)/2

ld$ProsperRating..factor. <- factor(ld$ProsperRating..numeric.)

ld$IncomeRange <- factor(
  ld$IncomeRange,
  levels(ld$IncomeRange)[c(8,1,2,4,5,6,3,7)])

ld.no_na <- ld[complete.cases(ld),]

summary(ld.no_na)
##   BorrowerRate    EstimatedReturn    ProsperRating..numeric.
##  Min.   :0.0400   Min.   :-0.18160   Min.   :1.000          
##  1st Qu.:0.1349   1st Qu.: 0.07408   1st Qu.:3.000          
##  Median :0.1845   Median : 0.09060   Median :4.000          
##  Mean   :0.1934   Mean   : 0.09529   Mean   :4.132          
##  3rd Qu.:0.2524   3rd Qu.: 0.11500   3rd Qu.:5.000          
##  Max.   :0.3600   Max.   : 0.26670   Max.   :7.000          
##                                                             
##       EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
##  Employed     :65884   Min.   :600           Min.   :619          
##  Full-time    : 7585   1st Qu.:660           1st Qu.:679          
##  Other        : 3526   Median :700           Median :719          
##  Retired      :  320   Mean   :699           Mean   :718          
##  Part-time    :  199   3rd Qu.:720           3rd Qu.:739          
##  Self-employed:   42   Max.   :880           Max.   :899          
##  (Other)      :    1                                              
##  DebtToIncomeRatio         IncomeRange    LoanOriginalAmount
##  Min.   : 0.0000   $50,000-74,999:24030   Min.   : 1000     
##  1st Qu.: 0.1500   $25,000-49,999:22023   1st Qu.: 4000     
##  Median : 0.2200   $100,000+     :14019   Median : 8000     
##  Mean   : 0.2588   $75,000-99,999:13644   Mean   : 9269     
##  3rd Qu.: 0.3200   $1-24,999     : 3840   3rd Qu.:14500     
##  Max.   :10.0100   Not employed  :    1   Max.   :35000     
##                    (Other)       :    0                     
##           LoanOriginationDate LoanOriginationDate.year CreditScoreAverage
##  2013-11-13 00:00:00:  459    Length:77557             Min.   :609.5     
##  2014-01-22 00:00:00:  443    Class :character         1st Qu.:669.5     
##  2013-10-16 00:00:00:  412    Mode  :character         Median :709.5     
##  2014-02-19 00:00:00:  404                             Mean   :708.5     
##  2014-01-28 00:00:00:  312                             3rd Qu.:729.5     
##  2013-09-24 00:00:00:  298                             Max.   :889.5     
##  (Other)            :75229                                               
##  ProsperRating..factor.
##  1: 5739               
##  2: 8617               
##  3:12824               
##  4:16890               
##  5:14631               
##  6:13705               
##  7: 5151

Next, I looked at the histograms for the BorrowerRate with 3 bins sizes to determine the shape of the distribution, I selected bins of 1%, 2,5% and 0.5%.

In addition, I created a Box Plot

To get a better understanding I plot the mean, median and interquartile measures in the 1% bin size histogram.

From the previous plot we observe that the distribution is close to a normal one but has a spike at about 0.32, which moves the mean and the median from the center of the curve. This must be further investigated.

Univariate Analysis

The dataset is in a wide format, with each row containing the information for a single loan, the loan and the borrower characteristics

The main feature of interest is the Borrower Rate, I want to be able to predict what rate a certain costumer would get.

In order to do this, I will take a closer look at the variables of related to the borrowers like credit score, income range, debt to income ratio, the amount of the loan and the rating prosper gives to each borrower.

Also, I created a credit average score, because it made sense to compare the data to a single credit than an upper and lower score. Also, I extracted the year from the loan origination date because I wanted to analyze the loans over the years and not a more granular time metric. Finally, I created a factor variable of the prosper ratings to be able to clearly differentiate the different borrowers in the plots.

Bivariate Plots Section

I will use GGpairs as my first plot to get an overview of the relationship between the variables of study.

Observations:

-Borrower Rate seems to be a normal distribution with a peak on a higher rate -Correlation Between Borrower rate and other variables: High: ProsperRating..numeric. (-0.953), EstimatedReturn (0.827). Medium: CreditScoreAverage (-0.529), LoanOriginal Amount (-0.405). Low: DebtToIncomeRatio (0.126) -ProsperRating..discrete. seems to be the biggest influence in Borrower Rates, since borrower rate boxplots are higher for lower ratings. -EmploymentStatus and IncomeRange does not seems to have much influence on Borrower Rate. - There is not any clear linear relationship between BorrowerRate and other variable

Firt I created various plots to analyze the proster rating vs te borrower rate

## # A tibble: 7 x 8
##   ProsperRating..factor. borrower_rate_min borrower_rate_1st_q
##                   <fctr>             <dbl>               <dbl>
## 1                      1            0.1779              0.3134
## 2                      2            0.1518              0.2712
## 3                      3            0.1158              0.2287
## 4                      4            0.0895              0.1765
## 5                      5            0.0693              0.1414
## 6                      6            0.0500              0.0999
## 7                      7            0.0400              0.0699
## # ... with 5 more variables: borrower_rate_mean <dbl>,
## #   borrower_rate_median <dbl>, borrower_rate_3rd_q <dbl>,
## #   borrower_rate_max <dbl>, n <int>

From the previous plots, we can see that prosper rating is indeed a very influential variable in the borrower rate. In fact, we can see that the borrower rate mean, median, interquartile distributions, min and max decrease as the rating goes up.

Also, it is very clear that most of the borrower with rating 1 get approximately 0.31 Rate. Them costumers with rating 2 have a greater spread than those with rating 3, and this repeats up to rating 7, where the interquartile is very close to the median.

In addition, when I plotted the histograms faceted by rating we can see that all ratings but rating = 1 have a close to normal distribution.

Next, I proceed to study the influence of other variables on the Borrower Rate

Credit Score Average

Loan Origination Date

Debt to Income Ratio

Loan Original Amount

Estimated return

Employment Status

Income Range

Bivariate Analysis

The result of the previous plots clearly shows that almost all of the variables have little to no effect on the borrower rate, this includes de date, the amount, the employment status, and debt to income.

The principal factor is the prosper rating. From the client side, besides the rating only the income range, loan amount and credit score seems to have some influence on the rate, decreasing it as the conditions are more favorable.

In addition, the expected return plot is interesting, there seems to be some linear relationships, another variable might help clear this out.

Multivariate Plots Section

Since it is clear that the main factor on the borrower rate is the prosper rating, I will try to discover other relationships between this rating and the other variables differencing each rating in the previous plots.

Credit Score Average vs Borrower Rate by Prosper Rating

Loan Origination Date vs Borrower Rate by Prosper Rating

Debt to Income Ratio vs Borrower Rate by Prosper Rating

Loan Original Amount vs Borrower Rate by Prosper Rating

Estimated Return vs Borrower Rate by Prosper Rating

Prosper Rating vs Borrower Rate by Income Range

Prosper Rating vs Borrower Rate by Employment Status

Multivariate Analysis

First, we can conclude that neither LoanOriginationDate, DebtToIncomeRatio, LoanOriginalAmount, IncomeRange nor EmploymentStatus have any clear relationship with the Borrower Rate.

Also, we can see that prosper rating is influenced by credit score average, however, the spread is big, with very different credit scores having the same rating.

In addition, we can see that over the years the rates for the ratings groups was maintained.

Lower ratings tend to have bigger debt to income rating, however this is not a rule, there are people with rating 1 and close to 0 debt to income ratio.

Lower ratings tend to get lower loan amounts, and as the loan amount increases, we see that the amounts tend to be round numbers, there are several loans made with amount of 10,000, 15,000, 20,000, 25,000, 30,000 and 35,000.

There are linear relationships for the estimated return and borrower rate when we group the data by rating, this describes that the bank calculates the higher rate the higher the return will be, a relationship closes to +1. However, we can also see that even with higher rates, the lower ratings are expected to return very low. Especially costumers with prosper rating of 1 tend to create loses. In addition, there seems to be another variable that divides that group, because multiple linear relationships within the group.


Final Plots and Summary

Plot One

Description One

In the first Plot, I observe that the distribution of the borrower rates (interest) for the loans its similar to a normal distribution if rating 1 is omitted, because almost of the costumer with this range get about 0.32 rate. Also, the other groups seem are in order and clearly separated, best ratings obtains bests rates. This is measured by the correlation, which is -0.953, a very strong relationship

Plot Two

Description Two

This plot is interesting because it demonstrated that better credit scores tend to get better borrower rates, however the relationship is not strong with a correlation of -0.529. Also, the fact that same credit score averages tend to have several costumers with different ratings. For example, we can find all the ratings in good amount in the 730-credit score average bin. This is very surprising for me because I thought that this will be a very influential variable on borrower rate and prosper rating. We can conclude that there are other variables influencing the rating and borrower rate.

Plot Three

Description Three

This final plot stands out because even thus I did not find any linear relationships in the other variables, here we see that if we divide each group of ratings, we could create several linear relationships.

This describes that the bank calculates that the higher rate the return will be higher too, a relationship close to +1. However, we can also see that this is the case for the ratings besides 1, where even the higher ratings are expected to return lower.

In conclusion costumers with prosper rating 1 tend to create loses. Also, there there seems to be another variable that divides that group, because multiple linear relationships may be within the group.


Reflection

Data analysis is a process where you first you have to understand each variable and the problem before doing any plots, because if you don’t understand the situation you can’t formulate the questions that will get the answers you need.

It is very important to clean up the data, because some variables might have classes or formats that will produce bad visualizations and findings. After that decide which is your main characteristic of interest, what are you trying to predict and study that variable alone, only after you have an understanding of it you can explore the relationships it has with the other variables.

Create plots for all the variables and try to find key findings that will lead your investigation until you find an explanation of the phenomenon’s discovered.

On a final note, I believe that the data is as valuable as it is real or trustworthy. Even thus is it from a great source the analyst must be aware of the limitations of the data due to how was is recollected, from who and under what circumstances.